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Abstract 

We study inference and learning based on a sparse coding model with 'spike-and-slab' prior. 
As standard sparse coding, the used model assumes independent latent sources that linearly 
combine to generate data points. However, instead of using a standard sparse prior such as a 
Laplace distribution, we study the application of a more flexible 'spike-and-slab' distribution 
which models the absence or presence of a source's contribution independently of its strength 
if it contributes. We investigate two approaches to optimize the parameters of spike-and-slab 
sparse coding: a novel truncated variational EM approach and, for comparison, an approach 
based on standard factored variational distributions. In applications to source separation we find 
that both approaches improve the state-of-the-art in a number of standard benchmarks, which 
argues for the use of 'spike-and-slab' priors for the corresponding data domains. Furthermore, 
we find that the truncated variational approach improves on the standard factored approach in 
source separation tasks - which hints to biases introduced by assuming posterior independence 
in the factored variational approach. Likewise, on a standard benchmark for image denoising, 
we find that the truncated variational approach improves on the factored variational approach. 
While the performance of the factored approach saturates with increasing number of hidden 
dimensions, the performance of the truncated approach improves the state-of-the-art for higher 
noise levels. 

Keywords: sparse coding, spike-and-slab distributions, approximate EM, variational Bayes, unsuper- 
vised learning, source separation, denoising 



1 Introduction 



Much attention has recently been devoted to studying sparse coding models with 'spike-and-slab' 



distribution as a prior over the latent variables Goodfellow et al. 2012 , Mohamed et al. , 2012 , Liicke 



and Sheikh, 2012 Titsias and Lazaro-Gredilla 2011, Carbonetto and Stephen 2011, Knowles and 



Ghahramani, 2011 Yoshida and West , 20101. In general, a 'spike-and-slab' distribution is comprised 



of a binary (the 'spike') and a continuous (the 'slab') part. The distribution generates a random 
variable by multiplying together the two parts such that the resulting value is either exatly zero (due 
to the binary random variable being zero) or it is a value drawn from a distribution governing the 
continuous part. In sparse coding models, employing spike-and-slab as a prior allows for modeling 
the presence or absence of latents independently of their contributions in generating an observation. 
For example, piano keys (as latent variables) are either pressed or not (binary part), and if they 
are pressed, they result in sounds with different intensities (continuous part). Note that the sounds 
generated by a piano are also sparse in the sense that of all keys only a relatively small number is 
pressed on average. 

Spike-and-slab distributions can flexibly model an array of sparse distributions, lending them 
desirable for many types of data. Algorithms based on spike-and-slab distributions have successfully 
been used, e.g., for transfer learning |Goodfellow et al~ 2012| , regression |West 2003, Carvalho 



et al. 2008 Carbonetto and Stephen, 2011 Titsias and Lazaro-Gredilla, 2011 , denoising [Zhou 



et al. 2009| Titsias and Lazaro-Gredilla, 2011 , and often represent the state-of-the-art on given 



benchmarks [compare Titsias and Lazaro-Gredilla 2011, Goodfellow et al. 2012 



The general challenge with spike-and-slab sparse coding models lies in the optimization of the 
model parameters. Whereas the standard Laplacian prior used for sparse coding results in uni- 
modal posterior distributions, the spike-and-slab prior results in multi- modal posteriors [see, e.g., 
Titsias and Lazaro-Gredilla, 2011, Liicke and Sheikh, 2012| . Fig.[T]shows typical posterior distribu- 



tions for spike-and-slab sparse coding (the model will be formally defined in the next section). The 
panels illustrate the posteriors for the case of a two-dimensional observed and a two-dimensional 
hidden space. As can be observed, the posteriors have multiple modes and the number modes 



increases exponentially with the dimensionality of the hidden space Titsias and Lazaro-Gredilla 



2011, Liicke and Sheikh, 20121. The multi-modal structure of the posteriors argues against the 



application of the standard maximum a-posteriori (MAP) approaches Mairal et al. 2009, Lee 



et al. 2007, Olshausen and Field, 1997 or Gaussian approximations of the posterior Seeger, 2008 



Ribeiro and Opper, 2011 because they rely on uni-modal posteriors. The approaches that have 



been proposed in the literature are, consequently, MCMC based methods [e.g., Carvalho et al. 
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Figure 1: Left panels visualize observations generated by two different instantiations of the spike- 
and-slab sparse coding model ([I]) to Solid lines are the generating bases vectors. Right 



panels illustrate the corresponding exact posteriors over latents computed using (16) and (19) 
given observations and generating model parameters. Note that the probability mass seen just 
along the axes or around the origin actually lies exactly on the axis. Here we have spread the mass 
for visualization purposes by slightly augmenting zero diagonal entries of the posterior covariance 



matrix in (19). 



2008, Zhou et al. 2009 Mohamed et al. 2012 and variational EM methodologies [e.g., Zhou et al. 



2009, Titsias and Lazaro-Gredilla, 2011, Goodfellow et al., 2012 . While MCMC approaches are 



more general and more accurate given sufficient computational resources, variational approaches 
are usually more efficient. Especially in high dimensional spaces, the multi-modality of the pos- 
teriors is a particular challenge for MCMC approaches; consequently, recent applications to large 



hidden spaces have been based on variational EM optimization [Titsias and Lazaro-Gredilla, 2011 



Goodfellow et al. , 2012 . The variational approaches applied to spike- and-slab models thus far [see 



Yoshida and West] |2010[ [Rattray et al.[|2009] |Goodfellow et al] |2012[ [Titsias and Lazaro-Gredilla 



2011 assume a factorization of the posteriors over the latent dimensions, i.e., the hidden dimensions 



are assumed to be independent a-posteriori. This means that any dependencies, e.g. explaining- 
away effects including correlations (compare Fig. [I]) are ignored and not accounted for. But what 
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consequences does such a neglection of posterior structure have? Does it result in biased parameter 
estimates and is it relevant for practical tasks? Biases induced by factored variational inference in 
latent variable models have indeed been observed before |MacKay 2001, Ilin and Valpola, 2003 



Turner and Sahani 20111. For instance, in source separation tasks, optimization through factored 



inference can be biased towards finding mixing matrices that represent orthogonal sparse direc- 
tions, because such solutions are most consistent with the assumed a-posteriori independence [see 



Ilin and Valpola 2003, for a detailed discussion]. Therefore, the posterior independence assumption 



in general may result in suboptimal solutions. 

In this work we study a variational EM approach for spike-and-slab sparse coding which does 
not assume a-posteriori independence while being able to model multiple modes. Instead of using 
factored distributions or Gaussians, the novel approach is based on posterior distributions truncated 



to regions of high probability mass |Liicke and Eggert , 2010| - an approach which has recently been 



applied to different models [see e.g., Puertas et al. 2010 Shelton et al. 2011, Dai and Liicke 



2012 1 . In contrast to the previously studied factored approaches |Titsias and Lazaro-Gredilla 



2011, Goodfellow et al. 2012, Mohamed et al. 2012 , the truncated approach will furthermore 



take advantage of the fact that in the case of a Gaussian slab and Gaussian noise model, integrals 



over the continuous latents can be obtained in closed- form [Liicke and Sheikh 2012| . This implies 
that the posteriors over latent space can be computed exactly if the sums over the binary part 
are exhaustively evaluated over expotentially many states. This enumeration of the binary part 
becomes computationally intractable for high-dimensional hidden spaces. However, by applying 
the truncated variational distribution exclusively to the binary part of the hidden space, we can 
still fully benefit from the analytical tractability of the continuous integrals. 

In this work, we systematically compare the truncated approach to a recently suggested fac- 



tored variational approach Titsias and Lazaro-Gredilla, 2011 . A direct comparison of the two 
variational approaches will allow for answering the questions about potential drawbacks and biases 
of both optimization procedures. As approaches assuming factored variational approximations have 



recently shown state-of-the-art performances Titsias and Lazaro-Gredilla 2011, Goodfellow et al. 



2012 , understanding their strengths and weaknesses is crucial for further advancements of sparse 



coding approaches and their many applications. Comparison with other approaches that are not 
necessarily based on the spike-and-slab model will allow for accessing the potential advantages of 
the spike-and-slab model itself. 

In section [2] we will introduce the used spike-and-slab sparse coding generative model, and 
briefly discuss the factored variational approach which has recently been applied for parameter 
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optimization. In section [3] we derive the closed-form EM parameter update equations for the 
introduced spike-and-slab model. Based on these equations, in section [4] we derive the truncated 
variational EM algorithm for efficient learning in high dimensions. In section [5j we numerically 
evaluate the algorithm and compare it to factored variational and other approaches. Finally, in 
section [6] we discuss the results, and present details of the derivations and experiments in the 
Appendix. 



2 Spike-and-slab Sparse Coding 



The spike-and-slab sparse coding model assumes like standard sparse coding a linear superposition 
of basis functions, independent latents, and Gaussian observation noise. The main difference is that 
a spike-and-slab distribution is used as a prior. Spike-and-slab distributions have long been used 



for different models [e.g., Mitchell and Beauchamp, 1988, among many others] and also variants 



of sparse coding with spike-and-slab priors have been studied previously [compare West, 2003 



Garrigues and Qlshausen[ |2007[ |Knowles and Ghahramanif |2007[ |Teh et aL| |2007[ |Carvalho et al.[ 



2008 , Paisley and Carin , 2009 , Zhou et al. , 2009 . In this work we study a generalization of the spike- 



and-slab sparse coding model studied by Liicke and Sheikh [2012 . The data generation process 
in the model assumes an independent Bernoulli prior for each component of the the binary latent 
vector s G {0, 1}^ and a multivariate Gaussian prior for the continuous latent vector z G M. H : 

H 



p(s\Q) = £(s;7f) = n< h (l 

h=l 

p (z\G) = N&ji,*) t 



(1) 

(2) 



where irh defines the probability of Sh being equal to one and where fl and ^ parameterize the 
mean and covariance of z respectively. The parameters fl G and ^ G M> HxH parameterizing 
the Gaussian slab in ^ generalize the spike-and-slab model used in |Liicke and Sheik h, 2012 . 
The two latent variables s and z are combined via point-wise multiplication (s z)^ = s^z^. 
The resulting hidden variable (sQz) follows a 'spike-and-slab' distribution, i.e., the variable entries 
have continuous values or are exactly zero. Given such a latent vector, a D-dimensional observation 
y £ M. D is generated by linearly superimposing a set of basis functions W and adding Gaussian 
noise: 

p(y\ s, z, 9) = M{y; W(sQ z), e), (3) 

where each column of the matrix W G M. DxH is a basis function W = (w\, . . . ,wh) and where 
the matrix £ G M DxD parameterizes the observation noise. We use = (W, S, fr, /x, to denote 
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all the model parameters. Note that having a spike-and-slab prior implies that for high levels of 
sparsity (i.e., low values of ir^) the latents assume exact zeros with high probability. This is an 
important distinction compared to the Laplace or Cauchy distributions used for standard sparse 
coding |01shausen and Field 1997 . 



The spike-and-slab sparse coding algorithm we derive in this work is based on the model ([!]) 
to ([3]). The factored variational approach |Titsias and Lazaro-Gredilla 2011 that we use for later 
comparison is based on a similar model, which we will refer to from now on as the MTMKL model. 
The MTMKL model is both a constrained and generalized version of the model we study. On one 
hand, it is more constrained by assuming the same sparsity for each latent, i.e., 7Vh = ^h' (f° r an 
h, h'); and by using a diagonal covariance matrix for the observation noise, £ = diag(crf, . . . ,&£))■ 
On the other hand, it is a generalization by drawing the basis functions W from Gaussian processes. 
The model ^ to (g) can then be recovered as a special case of the MTMKL model if the Gaussian 
processes are Dirac delta functions. For parameter optimization, the MTMKL model uses a stan- 
dard factored variational optimization. In the case of spike-and-slab models, this factored approach 
means that the exact posterior p(s,z\ y) is approximated by a variational distribution q n (s, z; O) 
that assumes the combined latents to be independent a-posteriori |Titsias and Lazaro-Gredilla[ 



2011, Goodfellow et al. 2012 



H 



q n (s, z; G) = JJ q { n h) (s h , z h ; 0), 



(4) 



h=l 



where are distributions only depending on and Zh and not on any of the other latents. A 



detailed account of the MTMKL optimization algorithm is given by Titsias and Lazaro-Gredilla 



[2011] and for the later numerical experiments, we use the source code provided along with that 
publication^ 



3 Expectation Maximization (EM) for Parameter Optimization 

In order to learn the model parameters O given a set of N independent data points {y ( n '}n=l ... JV 
with y( n ^ € M. D , we maximize the data likelihood C = Yln=i p(v^ I ®) by applying the Expectation 
Maximization (EM) algorithm. Instead of directly maximizing the likelihood, the EM algorithm 



[in the form studied by Neal and Hinton, 1998 maximizes the free-energy, a lower bound of the 
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we downloaded the code from http://www.well.ox.ac.uk/~mtitsias/code/varSparseCode.tar.gz 
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log-likelihood given by: 

N 



T(& old , 0) = E {^P(y (n) , s, A ©)) + #(0 old ), (5) 

n=l 

where ( • ) n denotes the expectation under the posterior over the latents s and z given y ^ 

(/& ^)) n = E / p& *\ y (n) > 0Old ) ^ di * ( 6 ) 

and H(Q old ) = — Es /z p(s, z\ y^ n \ G old ) log(p(s, z\ y^ n \ old )) dz is the Shannon entropy, which 
only depends on parameter values held fixed during the optimization of T w.r.t. in the M-step. 
Note that is a summation over all possible binary vectors s. 

The EM algorithm iteratively optimizes the free-energy by alternating between two steps. First, 
in the E-step given the current parameters old , the relevant expectation values under the posterior 
p(s, z\ y^ n \ old ) are computed. Next, the M-step uses these posterior expectations and maximizes 
the free-energy J-"(0 old ,0) w.r.t. 0. Iteratively applying E- and M-steps locally maximizes the 
data likelihood. In the following section we will first derive the M-step equations which themselves 
will require expectation values over the posteriors (Eqn.[6j. The required expressions for these 
expectations (the E-step) will be derived afterwards. 

3.1 M-step Parameter Updates 

The M-step parameter updates of the model are canonically obtained by setting the derivatives 
of the free-energy §5§ w.r.t. the second argument to zero. Details of the derivations are given in 
Appendix [A] and the resulting update equations are as follows: 

w = ^i* (w) <*o*£ (7) 
E£a<(*©^(*0 2) T ) B 
1 - 



7T 



JV 

n=l 
N 



S = Jf E [y (n) (y {n) ) T ~ 2 (W(sQ z) n )(y {n) f + W({(sQ z)(sQ .1 T > J^ T ] , (9) 



n=l 



z 



P = V/" (10) 

Ln=l { S ) n 



Eti <(*© z Wo?) T ) n -w T Ell <^> n 



and * = ^" =1XV - " ~ N ' ^ ^" =1X ^. (11) 



7 



3.2 E-step Expectation Values 

The M-step equations ([7]) to (11) require expectation values w.r.t. the posterior distribution be 
computed over the whole latent space which requires either analytical solutions or approximations 
of integrals over the latent space. For the derivation of closed- form E-step equations it is useful to 
note that the discrete latent variable s can be combined with the basis function matrix W so that 
we can rewrite ^ as 

p(y\ s, z, 0) = N(y; Wgz, £), (12) 

where we have defined (Wg)dh = W^Sh such that W(s z) = Wgz. 

Now first note that the marginal data likelihood p(y\ 0) can be derived in closed- form after 
marginalizing the joint p(y, s,z\Q) over z: 

p{y, s\ 0) = B(s; if) J Af(y; Wgz, S) Af(z; fl, *) dz (13) 
= B(s;if)M(y;W g fl,C g ), (14) 

where Cg = S + Wg^frWT. The second step follows from standard identities for Gaussian random 



variables [e.g., Bishop, 2006 . We can then sum the resulting expression over s to obtain 

P(V\ ©) = ??) AA(y; Wgp, Cg). (15) 

s 

Thus, the marginal distribution takes the form of a Gaussian mixture model with 2 H mixture 
components indexed by s. However, unlike in a standard Gaussian mixture model, note that the 
mixing proportions and the parameters of the mixture components are not independent but coupled 
together. Therefore, the following steps will lead to closed-form EM updates that are notably not 
a consequence of closed- form EM for classical Gaussian mixtures. In contrast, Gaussian mixture 
models assume independent mixing proportions and independent component parameters. By using 



Eqns. 14 and 15 the posterior over the binary latents p(s | y, 0) is given by: 



v(g \ ce) = p&y\Q) = EMI M{mWgji,Cg) 

PK iy > p(y\9) ^B(s';if)Af(y;Wgfl,Cgy 

We can now consider the factorization of the posterior p(s, z\y,Q) into the posterior over the binary 
part p(s\ y, 0) and the posterior over the continuous part given the binary state p(z\ s, y, 0): 

p(s,z\y,@) = p(s\y,@)p(z\s,y,e). (17) 
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Like the first factor in (17), the second factor is also analytically tractable and given by: 

_, Q v = p(s\e)p(z\e)p(y\z,s,@) 
' p(s | G) Jp(y | z, s, Q)p(z | Q)dz 

KN&p,V)N(y-,W g z,Z) 

= M(z-,K g , Ag), 

where the last step again follows from standard Gaussian identities with definitions 

A ? = (WjZ~ 1 W g + VSC 1 )- 1 , 
4° = (*© M) + A sWj S" 1 (y W - Wgp) 
and with 'J/^ = ^(^diag(s)^ . The full posterior distribution can thus be written as 



(18) 



(19) 



p(s,z\y^,0) 



B(s; tt) AOgW ; CV) A/~(i; 4 n) , A 



(20) 



Equation 20 represents the crucial result for the computation of the E-step below because, first, it 
shows that the posterior does not involve analytically intractable integrals and, second, for fixed 
s and y ^ the dependency on z follows a Gaussian distribution. This special form allows for the 
derivation of analytical expressions for the expectation values as required for the M-step updates. 
Because of the Gaussian form the integrations over the continuous part are straight-forward and 
the resulting expressions are given by: 



<*>» 

and ((sQz){sQz) T ) n 



^2q n (s;Q) ss T , 

i 

J2ln(s;@) (A,+ 4%^) T ). 



(21) 
(22) 

(23) 
(24) 



Note that the left-hand-sides of the expressions above are expectation values over the full latent 
space w.r.t. the posterior p(s, z\ y^ n \ 0), whereas the right-hand-sides now take the form of expec- 
tation values only over the binary part w.r.t. the posterior p(s\ y^ n \ 0) in Eqn. 
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The derivations 



of E-step equations (21 ) to (24) are a generalization of the derivations by Liicke and Sheikh (2012 



While Gaussian identities and marginalizations have been used to obtain analytical results for 



mixture-of-Gaussian priors before [e.g. Attias, 1999, Garrigues and Olshausen, 2007, Olshausen 



and Millman, 20001 , the above equations are the first closed-form solutions for the spike-and-slab 



model [first appearing in Liicke and Sheikh, 2012|. The observation that the Gaussian slab and 



Gaussian noise model allows for analytically tractable integrals has, in parallel work, also been 



noted and used by Mohamed et al. ,2012 



Iteratively computing the E-step equations using the current parameters for equations (21) 



to (24) and the M-step equations ([7]) to (11), represents a closed- form and exact EM algorithm 
which increases the data likelihood of the model to (possibly local) maxima. 



4 Truncated Variational EM 



While being exact, the execution of the above EM algorithm results in considerable computational 
costs for larger-scale problems. Without approximations, the computational resources required 
scale exponentially with the number of hidden dimensions H. This can be seen by considering 
the expected values w.r.t. the posterior p(s\ y, Q) above, which each require a summation over all 
binary vectors s£ {0,1}^. For tasks involving low dimensional hidden spaces the exact algorithm 
is still applicable. However, for higher dimensional problems approximations are required. Still, 
we can make use of the closed-form EM solutions by applying an approximation solely to the bi- 
nary part. Instead of sampling-based or factored approximations to the posterior p(s,z\ y, 0), we 
use a truncated variational approximation to the posterior p(s\ y^ n \ 0) in Eqn. 16. The truncated 



approximation is defined to be proportional to the true posteriors on subspaces of the latent space 



with high probability mass [compare Liicke and Eggert 2010, Puertas et al. 2010| . More con- 
cretely, a posterior distribution p{s \ y^ n \ 0) is approximated by a distribution q n (s; 0) that only 
has support on a subset K, n C {0, 1}^ of the state space: 

p(s,y^ |0) 



i(«;e) 



^ p{s\y {n) | 0) 



5(se K n ), 



(25) 



where 5(s S K, n ) is an indicator function, i.e., 5(s G /C n ) = 1 if s € K. n and zero otherwise. Note 



that the definition of q n (s; 0) in (25) neither assumes uni-modality like MAP Mairal et al. 2009 



Lee et al. 2007, Olshausen and Field, 1997 or Gaussian approximations of the posterior |Ribeiro 



and Opper, 2011, Seeger, 2008 , nor does it assume a-posteriori independence of the latents as 



factored approximations |Jordan et al. 1999 Goodfellow et al. 2012 Titsias and Lazaro-Gredilla 



2011 . Moreover, the approximation can inherently exploit the fact that the continous part of the 



latents can be integrated out. The basic assumption behind the approximation in (25) is that the 
posterior mass is sufficiently concentrated in IC n - an assumption that in general holds for data 
that are assumed to be generated by sparse latent causes. 
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It is shown by Liicke and Eggert |2010 that for appropriately defined subspaces K, n the KL- 
divergence between the true posteriors and their truncated approximations converges to values close 
to zero. For our approach the subsets fC n are defined based on index sets /„ C {1, . . . , H} which 
contain the H' most relevant dimensions for the corresponding data points y( n ); 



K-n = {s | Ylh s h<l and Mh I n : s h = 0} U U, 



(26) 



where /„ contains those H' hidden dimensions with highest values of a selection (or scoring) function 
S h {y^ n \Q) (which we define later). The set U is defined as U = {s \ Sh = 1} and ensures that 



)C n contains all singleton states [compare Liicke and Eggert, 2010| . Otherwise, KL n only contains 
vectors with at most 7 non-zero entries and with non-zero entries only permitted for h £ I n . By 
choosing an appropriate index set I n for each data point, Eqn. [25] results in a good approximation 
to the posterior. In our case of the sparse spike- and-slab model of Eqns.[T] to [3] the truncated 
approximate posterior is given by: 



,(s;e) 



(27) 



To obtain some intuition for the approximation in the case of spike-and-slab sparse coding note that 
by far most generated data points lie close to low-dimensional hyper-planes spanned by the models' 
basis functions. The posterior mass given such data points is consequently concentrated on low- 
dimensional hyper-planes spanned by a small number of hidden dimensions (compare Fig.[TJ upper- 
right). The posterior mass in higher-dimensional subspaces is negligible. If the hidden dimensions 



in I n are appropriately selected, the variational approximation (27) approximates the true posterior 
with high accuracy by matching its distributions on the low-dimensional hyper-planes. To ensure 
high accuracy, the dimensionality of these hyper-planes (approximation parameter 7) will be chosen 
relatively high compared to the average number of contributing hidden dimensions. 



The truncated approximation (27) can now be used to compute the expectation values which 



are required for the M-step equations. If we use the variational distributions in Eqn. 27 for q n (s; 0) 



on the right-hand-sides of Eqns. 21 to |24[ we obtain: 

^(y {n) ; w 3 ji, Cg) m g m 



(28) 



where f(s) denotes any of the (possibly parameter dependent) functions of (21) to (24). Instead of 
having to compute sums over the entire binary state space with 2 H states, only sums over subsets 
JC n have to be computed. Since for many applications the posterior mass is finally concentrated in 
small volumes of the state space, the approximation quality can stay high even for relatively small 
sets K, n . 
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4.1 Computational Complexity 



The truncated E-step denned by (21 ) to (24) with (28) scales with the approximation parameters 7 



and H 1 which can be defined independently of the latent dimensionality H. The complexity scales 
as O (N Yly=o (y ) ifl + 7') 3 ) > where the -D 3 term can be dropped from the cubic expansion if the 
observed noise S is considered to be diagonal or homoscedastic (i.e., £ = a 2 1). Also note that the 



truncated approximation yields sparse matrices in Eqns. 25 and [27] which results in more efficient 
and tractable matrix operations. 

Although the total number of data points N above defines a theoretical upper bound, in practice 
we can further benefit from the preselection step of the truncated approach to achieve significantly 
improved runtime performance. Clustering the data points using the index sets I n saves us from 



redundantly performing various computationally expensive operations involved in Eqns. 19 and[27| 



that given a state s G KL n are independent of individual data points sharing the same subspace 
K, n . Furthermore, such a batch processing strategy is also readily parallelizable as the truncated 
E-step can be performed independently for individual data clusters (see Appendix [C| for details). 
Using the batch execution mode we have observed an average runtime speedup of up to an order 
of magnitude. 



4.2 Selection function 

To compose appropriate subsets JC n a selection function Sh(y^ n \&) is defined, which prior to 
each E-step selects the relevant hidden dimensions (i.e., the elements of the index set I n ) that 
are most likely to have contributed to the data point y( n \ Selection functions can be based on 



upper-bounds for probabilities p(sh = 1 1 y^ n \ 0) [compare Liicke and Eggert 



2010 



Puertas et al. 



2010] or deterministic functions such as the scalar product between a basis vector and a data point 



[derived from noiseless limits applied to observed space; compare Liicke and Eggert 2010 



For the sparse coding model under consideration we define a selection function as follows: 



i?(»)l3> = ?, 



4,0), 



(29) 



where Sh represents a singleton state in which only the entry h is non-zero. The selection function 



(29) is basically the data likelihood given a singleton state %. The function does not take into 



account the probability of the state itself (i.e., p(sh | ©)), as this may introduce a bias against less 
active latent dimensions. The accuracy of the selection function will directly be evaluated in the 
next section. 
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Equations (25) to (27) replace the computation of the expectation values w.r.t. the exact poste- 
rior, and represent the approximate EM algorithm used in the experiments section. The algorithm 
will be applied without any further mechanisms such as annealing as we found it to be very ro- 
bust in the form derived above. Furthermore, no data preprocessing such as mean subtraction or 
variance normalization will be used in any of the experiments. To distinguish the algorithm from 
others in comparative experiments, we will refer to it as Gaussian Sparse Coding (GSC) algorithm 
in order to emphasize the special Gaussian case of the spike-and-slab model used. 

5 Numerical Experiments 

We investigate the performance of the GSC algorithm on artificial data as well as various realistic 
source separation and denoising benchmarks. For all experiments the algorithm was implemented 



to run on arrays of CPU nodes as described in Bornschein et al. 2010 . We extended the technique 



to make our implementation more efficient and suitable for parallelization by making use of the 



observation discussed in section 4.1 and Appendix [Cj In all the experiments, the GSC parameters 
were randomly initialized^] The choice of GSC truncation parameters H' and 7 is in general 
straight-forward: the larger they are the closer the match to exact EM but the higher are also the 
computational costs. The paramters are hence capped by the available computational resources. 
However, empirically we observed that often much smaller than maximally affordable values were 
sufficient Note that factored variational approaches do usually not allow to adjust the accuracy 
/ complexity trade-off using a simple parameter change. 

5.1 Reliability of the Selection Function 

To assess the reliability of the selection function we perform a number of experiments on small scale 



artificial data generated by the model, such that we can compute both the exact ( 16 ) and truncated 



(27) posteriors. To control for the quality of the truncated posterior approximation - and thus the 



selection function - we compute the ratio between posterior mass within the truncated space fC r 



2 

We randomly and uniformly initialized the between 0.05 and 0.95. jl was initialized with normally distributed 
random values and the diagonal of \t was initialized with strictly positive uniformly distributed random values. We 
set E to the covariance accross the data points, and the elements of W were independently drawn from a normal 
distribution with zero mean and unit variance. 

Q I I 

Compare Appendix B for trade-off between complexity and accuracy of the truncated approach 
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and the overall posterior mass [compare Liicke and Eggert 2010] : 

Q {n) 



Y, 1 ?B{s>;Ti)N{yW;W 1? fl,C 1 ») 



(30) 



where the integrals over the latent z in (30) are again given in closed- form. Note that ranges 



from zero to one and that it is directly related [see Liicke and Eggert, 2010 to the KL-divergence 



between the approximation q n in Eqn. 27 and the true posterior: 

D KL (q n (s,z; @),p(s,z\ ^,0)) = -\og(Q^) 



(31) 



If Q^ n ' is close to one, the KL-divergence is close to zero. 

Data for the control experiments were generated by linearly superimposing basis functions that 
take the form of horizontal and vertical bars [see e.g., Foldiak 1990, Hoyer, 2002 on a D = D2 XD2 



pixel grid, where D2 = H/2. This gives us D2 possible horizontal as well as vertical locations for 
bars of length D2, which together form our generating bases VF 86 ". Each bar is then randomly 
assigned either a positive or negative intensity of 10. We set the sparsity such that there are on 
average two active bars per data point i.e., 7r| en = 2/H for all h 6 H. We assume homoscedastic 
observed noise E gen = a 2 Id, where a 2 = 2. The mean of the generating slab is i.i.d. drawn from a 
Gaussian: fl gen ~ Af(0, 5), and the covariance of the slab is fy gen = I H . We generate N = 1000 data 
points. We run experiments with different sets of values for the truncation parameters (i?' ,7) G 
{(4,4), (5,4), (5,3)} for each H £ {10, 12}. Each run consisted of 50 EM iterations and after each 
run we computed the Q-value over all the data points. For all the experiments we find the average 



Q-values to be above 0.99, which shows that the state subspaces (26) constructed from the H' 



latents chosen through the selection function (29) contain almost the entire posterior probability 
mass in this case. 



5.2 Recovery of Sparse Directions on Synthetic Data 

First we assess the performances both the GSC approach (using the truncated variational approx- 
imation) and MTMKL approach (which uses a factored variational approximation) on synthetic 
data generated by standard sparse coding models. In one set of experiments we generate data using 



sparse coding with Cauchy prior Olshausen and Field, 1996), and in a second set of experiments 



we use sparse coding with a Laplace prior Olshausen and Field , 1996 Lee et al. , 2007 . For each 



trail in an experiment a new mixing matrix W geD was generated without any constraints on the 
sparse directions (i.e., matrices were non-orthogonal in general). In both sets of experiments we 
simultaneously vary both the observed and latent dimensions D and H between 20 and 100, and 



14 



0.3 



0.2 



0.1 



20 



-GSC 

- MTMKL 



-I — 
4- 



40 



60 



Cauchy 



-¥ 1 



Laplace 



80 100 20 40 60 80 100 

H = D 



Figure 2: Performance of GSC (with H ' = 7 = ^) vs. MTMKL on data generated by standard 
sparse coding models both with Cauchy and Laplace priors. Performance compared on the Amari 



index (32) 



repeat 15 trials per given dimensionality. For each trail we randomly generated a new dataset with 
N = 5000. Per trial, we perform 50 iterations of both algorithms. The GSC truncation parame- 
ters H' and 7 were set to We assess the performances of the algorithms in terms of how well 
the bases W inferred by each of them align with the ground truth bases P^ gen . As a measure of 



discrepancy between the generating and the recovered bases we use the Amari index | Amari et al. 



1996 



A(W) 



2H H 



1 H 

' h,h'= 



\o 



hi, 1 



+ 



10 



hh' 



Vmax^/ \Ohh"\ mayL h t,\O h 



H-V 



(32) 



where Oh 



W 1 W sen ) Note that the Amari index is either positive or zero. It is zero only 



'hh - \" " ) hh 

when the basis vectors of W and W gen represent the same set of orientations, which in our case 
implies a precise recovery of the (ground truth) sparse directions. 

The results for GSC and MTMKL in Fig. [2] show that both approaches do relatively well in 
recovering the sparse directions, which shows that they are robust against the model mismatch 
imposed by generating from models with other priors. Furthermore, we observe that the GSC 
approach consistently recovers the sparse directions more accurately. 



5.3 Source Separation 

On synthetic data we have seen that spike-and-slab sparse coding can effectively recover sparse 
directions such as those generated by standard sparse coding models. As many signals such as 
acoustic speech data are sparse, and as different sources mix linearly, the assumptions of sparse 
coding match such data well. Source separation is consequently a natural application domain of 
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sparse coding approaches, and well suited for benchmarking novel spike-and-slab as well as other 
sparse coding algorithms. To systematically study the a-posteriori independence assumption in 
factored variational approaches, we monitor the recovery of sparse directions of GSC and MTMKL 
for an increasing degree of the mixing matrix's non-orthogonality. Fig. [3] shows the performance of 
both the methods based on three different source separation benchmarks obtained from [ICALAB 



Cichocki et al. , 2007 . The error bars are generated by repeating 15 trials per experiment. The 
x-axis in the figure represents the degree of orthogonality of the ground truth mixing bases W gen . 
Starting from strictly orthogonal at the left, the bases were made increasingly non-orthogonal 
by randomly generating orthogonal bases and adding Gaussian distributed noise to them with 
a = (4, 10, 20) respectively. For Fig. [3] no observation noise was added to the mixed sources. For 
both the algorithms we performed 100 iterations per rurj^J The GSC truncation parameters H' 
and 7 were set to 10 for all the following experiments, therefore for lOhalo the GSC inference was 
exact. As can be observed, both approaches recover the sparse directions well. While performance 
on the EEG19 dataset is the same, GSC consistently performs better than MTMKL on lOhalo and 
Speech20. If observation noise is added, the difference can become still more pronounced for some 
datasets. Fig. [4] shows the performance in the case of Speech20 (with added Gaussian noise with 
a = 2.0), for instance. Along the x-axis orthogonality increases, again. While the performance 
of MTMKL decreases with decreasing orthogonality, performance of GSC increases in this case. 
For other datasets increased observation noise may not have such effects, however (see Appendix, 



Fig. 10 for two examples). 

Next we look at MAP based sparse coding algorithms for the source separation task. Publicly 



available methods which we compare with are the lasso based least angle regression [SPAMS; Mairal 



et al. , 2009 and the efficient sparse coding algorithms [ESCA; Lee et al. , 2007] with epsilonLi- 

benchmarks 



penalt}j^j The performance is tested on another set of ICALAB 



Cichocki et al. 



2007 



used previously Suzuki and Sugiyama, 2011, Liicke and Sheikh, 2012 . Following Suzuki and 



Sugiyama 2011 we use N = 200 and N = 500 data points from each benchmark and generate 



observed data by mixing the benchmark sources with randomly generated orthogonal bases and 
adding no noise to the observed data. For each experiment we performed 50 trials with a new 
randomly generated orthogonal data mixing matrix W gen and new parameter initialization in each 
trial. The GSC inference was exact for these experiments with better results obtained with observed 
4 



For the MTMKL algorithm we observed convergence after 100 iterations while the GSC algorithm continued to 
rove with more iterations. However, the reported results are obtained with 100 iterations. 
For both the algorithms, optimal values for regularization parameters were chosen through cross-validation. 
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Figure 3: Performance of GSC vs. MTMKL on source separation benchmarks with varying degrees 
of orthogonality of the mixing bases. The orthogonality on the x-axis varies from being orthogonal 
_L to increasingly non-orthogonal mixing as randomly generated orthogonal bases are perturbed by 
adding Gaussian noise M(0, a) to them. No observation noise was assumed for these experiments. 



Performances are compared on the Amari index ( 32 ) . 
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Figure 4: Performance of GSC vs. MTMKL in terms of the Amari index (32) on the Speech20 



benchmark with varying degrees of orthogonality of the mixing bases and Gaussian noise (with 
a = 2) added to observed data. The orthogonality on the x-axis varies from being orthogonal _L 
to increasingly non-orthogonal mixing as randomly generated orthogonal bases are perturbed by 
adding Gaussian noise A/"(0, a) to them. 
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datasets 


Amari index - mean (std.) 


name 


H = D 


N 


GSC MTMKL SPAMS ESCA 


lOhalo 


10 


200 
500 


0.27(.04) 0.21(.05) 0.28(0) 0.31(.02) 
0.17(.03) 0.20(.03) 0.29(0) 0.29(.02) 


Sergio7 


7 


200 
500 


0.19(.05) 0.19(.03) 0.18(0) 0.27(.04) 
0.13(.04) 0.23(.04) 0.19(0) 0.18(.04) 


Speech4 


4 


200 
500 


0.13(.04) 0.14(.03) 0.18(0) 0.23(.02) 
0.10(.04) 0.14(.08) 0.16(0) 0.17(0) 


c5signals 


5 


200 
500 


0.29(.08) 0.24(.08) 0.39(0) 0.47(.05) 
0.31(.06) 0.32(.03) 0.42(0) 0.48(.05) 



Table 1: Performance of GSC, MTMKL and other publicly available sparse coding algorithms on 



benchmarks for source separation. Performances are compared based on the Amari index (32). 
Bold values highlight the best performing algorithm(s). 



noise constrained to be homoscedasticj^J We performed up to 350 iterations of the GSC algorithm 
(with more iterations continuing to improve the performance) while for the other algorithms we 
observed convergence between 100 and 300 iterations. 

Tab. [T] lists the performances of the algorithms. As can be observed, the spike-and-slab based 
models perform better than the standard sparse coding models for all except of one experiment 
(Sergio7, 200 data points) where SPAMS performs comparable well (or slightly better). Among the 
spike-and-slab models, GSC performs best for all settings with 500 data points, while MTMKL is 
better in two cases for 200 data pointfl Note that further improvements on some settings in Tab.[l| 
can be obtained by algorithms constrained to assume orthogonal bases |Suzuki and Sugiyama 



Liicke and Sheikh, 2012 . However, for lOhalo and speech^ GSC and MTMKL are better without 



such an explicit constraint. 



6 
7 T 



To infer homoscedastic noise we set in the M-step the updated noise matrix £ to a 2 In where a 2 = Tr(£)/ZX 
By considering Tab. [l] performance not necessarily increases with an increased number of data points. Note in 



this respect, however, that the used data points are not independent samples. Following Suzuki and Sugiyama 201 1| 
we always took consecutive 200 or 500 data points (after an offset) from each of the benchmarks. Therefore, due to 
time-dependencies in the signals, the underlying data point statistics change with the number of data points. 
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5.4 Computational Complexity vs. Performance 

In terms of complexity, GSC and MTMKL algorithms are significantly different, so we also looked 
at the trade-off between their computational costs versus performance. Figj5] shows performance 
against compute time for both algorithms. The error bars for the Speech20 plot were generated 
from 15 trials per experiment. For MTMKL we obtained the plot by increasing the number of 
iterations from 50 to 100 and 1000, while for the GSC plot we performed 100 iterations with 
H' = 7 G [2,3,5,7,10]. For the image denoising task (described next), the MTMKL plot was 
generated from a run with H = 64 latent^] and the number of iterations going up to 12K. The 
GSC plot was generated from H = 400 latents and H' and 7 being 10 and 5 respectively. Up to 
120 iterations were performed to obtain the last point for the GSC plot. As can be observed for 
both tasks, the performance of MTMKL saturates from certain runtime values onwards. GSC on 
the other hand continues to show improved performance with increasing computational resources. 

5.5 Image Denoising 

Finally, we investigate performance of the GSC algorithm on the standard "house" benchmark for 



denoising which has been used for the evaluation of similar approaches [e.g., Li and Liu, 2009, Zhou 



et al. , 2009 including the MTMKL spike- and-slab approach. The MTMKL approach currently 



represents the state-of-the-art on this benchmark [see Titsias and Lazaro-Gredilla , 2011 . For the 
task an input noisy image is generated by adding Gaussian noise (with zero mean and standard 
deviation determining the noise level) to the 256 x 256 image (see Fig. [6]). Following the previous 
studies, we generated 62, 001 overlapping (shifted by 1 pixel) 8x8 patches from the noisy image. 
We then applied 65 iterations of the GSC algorithm for H G {64, 256} for different noise levels 
a G {15,25,50}. The truncation parameters H' and 7 for each run are listed in Tab.[2j We 
assumed homoscedastic observed noise with a priori unknown variance in all these experiments 
(as the MTMKL model). Tab.[2] compares the denoising results of different algorithms in terms 
of the peak signal-to-noise (PSNR) ratio. We found that for the low noise level (<r = 15) GSC is 
competitive with other approaches but with MTMKL performing slightly better. For the higher 
noise levels of a = 25 and a = 50, GSC outperforms all the other approaches including the MTMKL 
approach that represented the state-of-the-art. In Fig. [6] we show our result for noise level 25. The 
figure contains both the noisy and the GSC denoised image along with the inferred sparsity vector 
7r and all bases with appearance probabilities significantly larger than zero (sorted from high such 



Titsias and Lazaro-Gredilla 



2011 



report that for the denoising task they observe no performance improvements 



for larger number of latents. 
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Figure 5: Runtime vs. performance comparison of GSC and MTMKL on source separation and 
denoising tasks. Source separation is compared on the Amari index (the lower the better) while 
the denoising is compared on the peak signal-to-noise (PSNR) ratio (the higher the better). 



probabilities to low ones). We also tried GSC with higher numbers of latent dimensions. Although 
for low noise levels of a = 15 and a = 25 we did not notice significant improvements, we observed 
a further increase for a = 50. For instance, for H = 400, H' = 10 and 7 = 8, we obtained a PSNR 
of 28.48dB. 



As for source separation described in section 5.4, we also compared performance vs. computa- 
tional demand of both algorithms for the task of image denoising. As can be observed by considering 
Fig.[5j MTMKL performs better when computational resources relatively limited. However, when 
increasingly more computational resources are made available, MTMKL does not improve much 
further on its performance while GSC continuously increases and eventually outperforms MTMKL 
on this task. 



Noise 


PSNR (dB) 


Noisy img 


MTMKL 61 ?' 


K-SVD m4s ' * K _ S y D match 


Beta pr. 


GSC (H=64) GSC (H=256) 


CT=15 


24.59 


34.29 


30.67 34.22 


34.19 


32.68 (H'=io, 7 =8) 33.78 (H'=18, 7 =3) 


ct=25 


20.22 


31.88 


31.52 32.08 


31.89 


31.10 (H'=10,7=8) 32.01 (H' = 18,7=3) 


<r=50 


14.59 


28.08 


19.60 27.07 


27.85 


28.02 (H'=10,7=8) 28.35 (H' = 10,7=8) 



Table 2: Comparison of the GSC algorithm with other methods applied to the "house" benchmark. 
The compared methods are: MTMKL |Titsias and Lazaro-Gredilla 2011 , K-SVD |Li and Liu[ 



2009 , and the method by Zhou et al. 20091. Bold values highlight the best performing algorithm(s). 



*High values for K-SVD matched are not made bold-faced as the method assumes the noise variance 



to be known a-priori [see Li and Liu, 2009 . 
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Figure 6: Top left: Noisy "house" image with a = 25. Top right: GSC denoised image. Middle: 
Inferred sparsity values tt^ in descending order indicate that finally around 107 of in total 256 latent 
dimensions significantly contribute to model the data. Bottom: Basis functions (ordered from left 
to right, top to bottom) corresponding to the first 107 latent dimensions sorted w.r.t. the decreasing 
sparsity values ir^ . 
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6 Discussion 



The last years have seen a surge in the application of sparse coding algorithms to different research 
domains, along with developments of new sparse coding approaches with increased capabilities. 
There are currently different lines of research followed for developing new algorithms: One direction 



is based on the standard sparse coding algorithm Olshausen and Field 1996 with Laplace prior 
and parameter optimization using maximum a-posteriori (MAP) estimates of the latent posteriors 
for efficient training. This original approach has since been made more efficient and precise. Many 
sparse coding algorithms based on the MAP estimation are continuously being developed and 



successfully applied in a variety of settings [e.g., Lee et al. 2007, Mairal et al. 2009 . Another line 
of research aims at a fully Bayesian description of sparse coding and emphasizes greater flexibility 
by using different (possibly non-Gaussian) noise models and estimations of the number of hidden 
dimensions. The great challenge of these general models is the procedure of parameter estimation. 



For instance, the model in |Mohamed et al. 2012| uses Bayesian methodology involving conjugate 
priors and hyper-parameters in combination with Laplace approximation and different sampling 
schemes. 

A line of research falling in between conventional and fully Bayesian approaches is represented 
by the truncated variational approach studied here and by other very recent developments [Tit 



sias and Lazaro-Gredilla, 2011 Goodfellow et al. 2012 . While these approaches are all based on 



spike-and-slab generalizations of sparse coding as fully Bayesian approaches, they maintain deter- 
ministic approximation procedures for parameter optimization. Variational approximations allow 
for applications to large hidden spaces, posing a challenge for sampling approaches especially in 
cases of multi-modal posteriors. Using the novel and existing approaches in different experiments 
of this study, we have confirmed the advantages of spike-and-slab priors for sparse coding, and the 
scalability of variational approximations for such models. The newly developed truncated varia- 
tional algorithm scales almost linearly with the number of hidden dimensions for fixed truncation 
parameters (see for instance the scaling behavior in supplemental Fig. [8] for H going up to 1024). 
The MTMKL algorithm by |Titsias and Lazaro-Gredilla| pOlT] has been applied on the same scale. 
Using a similar approach also based on factored distributions Goodfellow et al. |2012 report results 
for up to a couple of thousands latent dimensions (albeit on small input dimensions and having 
a more constrained generative model). Sampling based algorithms for non-parametric and fully 
Bayesian approaches are more general but have not been applied to such large scales. 

A main focus of this work and reasoning behind the algorithm's development is due to the long- 



known biases introduced by factored variational approximations Ilin and Valpola, 2003, MacKay 
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2001, Turner and Sahani, 2011 . Our systematic comparison of the GSC algorithm to the method 



by Titsias and Lazaro-Gredilla [201 1 confirms the earlier observation [Ilin and Valpola" 2003 that 
factored variational approaches are biased towards orthogonal bases. If we compare the perfor- 
mance of both algorithms on the recovery of non-orthogonal sparse directions, the performance 
of the factored variational approach is consistently lower than the performance of the truncated 
variational algorithm (Fig. [2]). The same applies for experiments for unmixing real signals in which 



we increased the non-orthogonality (Fig.[3]^,C; suppl. Fig. 10); although for some data performance 
is very similar (Fig.[3^>). Also if sources are mixed orthogonally, we usually observe better perfor- 
mance of the truncated variational approach (Tab. [I]), which is presumably due to the more general 
underlying prior (i.e., a fully parameterized Gaussian slab). Overall, GSC is the best performing 



algorithm on source separation tasks involving non-orthogonal sparse directions [compare Suzuki 



and Sugiyama, 2011, for algorithms constrained to orthogonal bases]. For some datasets with few 



data points, we observed an equal or better performance of the MTMKL approach, which can be 
explained by their Bayesian treatment of the model parameters (see Tab.[TJ performance with 200 
data points). Notably, both approaches are consistently better on source separation benchmarks 
than the standard sparse coding approaches SPAMS Mairal et al. , 2009 and ESCA |Lee et aL 



20071 (see Tab.[l]). This may be taken as evidence for the better suitability of a spike-and-slab prior 



for such types of data. 

Finally, we compared the performance of factored and truncated variational approximations on 
a standard denoising task (see Tab.[2|. The high PSNR values observed for both approaches again 
in general speak for the strengths of spike-and-slab sparse coding. The MTMKL model represented 
the state-of-the-art on this benchmark, so far. Differences of MTMKL to previous approaches are 
small, but this is due to the nature of such long-standing benchmarks (compare, e.g., the MNIST 
data set). For the same denoising task with standard noise levels of a = 25 and a = 50 we found the 
GSC model to further improve the state-of-the-art (compare Tab. [2] with data by Li and Liu (2009 



Zhou et al.| [2009] , |Titsias and Lazaro-Gredilla| [2011] ). While we observed a continuous increase of 



performance with the number of hidden dimensions used for GSC, the MTMKL algorithm |Titsias 



and Lazaro-Gredilla, 2011] is reported to reach saturation at H = 64 latent dimensions. As the 



learned sparse directions become less and less orthogonal the more overcomplete the setting gets, 
this saturation may again be due to the bias introduced by the factored approach. GSC with 
H = 256 improves the state-of-the-art with 32.01dB for a = 25 and with 28.35dB for a = 50 (with 
even higher PSNR for H = 400). Because of assuming an independent Bernoulli prior per latent 
dimension, GSC can also prune out latent dimensions by inferring very low values of 7r^ for the 
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bases that make negligible contribution in the inference procedure. This can be observed in Fig.|6j 
where for the application of GSC to the denoising task with a = 25, we found only about 107 
of the 256 basis functions to have significant probabilities to contribute to the task. This means 
that GSC with about 100 basis functions can be expected to achieve almost the same performance 
as GSC with 256 basis functions. However, in practice we observed that the average performance 
increases with more basis functions because local optima can more efficiently be avoided. Note that 
this observation is not limited to the particular approach studied here; also for other approaches 
to sparse learning, efficient avoidance of local optima has been reported if the number of assumed 



hidden dimensions was increased [e.g. Spratling, 2006, Liicke and Sahani 2008| . In comparison 
to MTMKL, GSC can make use of significantly more basis functions. It uses about 100 functions 
while MTMKL performance saturates at about 64 as mentioned previously. On the other hand, 
we found MTMKL to perform better on the low noise level setting (see a = 15 in Tab. [2]) or when 
relatively limited computational resources are available (see Fig. [5]). 

In conclusion, we have studied a novel learning algorithm for sparse coding with spike-and-slab 
prior and compared it with a number of sparse coding approaches including other spike-and-slab 
based methods. The results we obtained show that the truncated variational approach is a com- 
petitive method. It shows that posterior dependencies and multi-modality can be captured by a 
scalable variational approach. Furthermore, the direct comparison with a factored approach in 
source separation experiments also confirms earlier observations that assumptions of a-posteriori 
independence introduces biases, and that avoiding such biases, e.g. by a truncated approach, im- 
proves the state-of-the-art on source separation benchmarks as well as on standard denoising tasks. 
However, we also find that under certain constraints and settings, factored variational learning for 
spike-and-slab sparse coding may perform as good or better. In general, our results argue in favor 
of spike-and-slab sparse coding models and recent efforts for developing improved algorithms for 
inference in such models. 
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Appendix 

We give details on derivations and numerical experiments. 



A Derivation of M-step Equations 



Our goal is to optimize the free-energy w.r.t. 0: 
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The free-energy thus takes the form: 
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where q n (s, z; old ) denotes the posterior p(s, z \ y^ n \ old ). Now we can derive the M-step equations 
to ( |11[ ) by canonically setting the derivatives of the free-energy above w.r.t. each parameter in 
to zero. 
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Optimization of the Data Noise 

Let us start with the derivation of the M-step equation for S: 
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where ( • } n denotes the expectation value in Eqn.rol 



Optimization of the Bases 

We will now derive the M-step update for the basis functions W: 
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Optimization of the Sparsity Parameter 

Here we take the derivative of the free-energy w.r.t. n: 
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Optimization of the Latent Mean 

Now we derive the M-step update for the mean jl of the Gaussian slab: 



N 



- n *) T (* ( - n * ) 



EE q n (s,z;@° l < 
»=1 s £ 

N r 

EE Qn(s,z;Q old ) {^Qs^y 1 ^- fl)Qs) dz = 

n=l s - L 



d£ 



En=l (*) n 



Optimization of the Latent Covariance 

Lastly we derive the M-step update for the latent covariance \&: 
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B Performance vs. Complexity Trade- Off 

If the approximation parameters H' and 7 are held constant, the computational cost of the algo- 
rithm scales with the computational cost of the selection function. If the latter cost scales linearly 
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Figure 7: Performance of the GSC on lOhalo, EEG19 and Speech20 benchmarks for decreasing trun- 
cation parameters H' and 7. The right panel shows how the computational demand of the truncated 
variational algorithm decreases with the values of the truncation parameters. The runtime plots 
are normalized by the runtime value obtained for H' = 7 = 10 for each of the benchmarks. 



with H (as is the case here), then so does the overall computational complexity [compare com- 



plexity considerations by Liicke and Eggert , 2010] ) . This is consistent with numerical experiments 
in which we measured the increase in computational demand (see Fig. [8]). In experiments with H 
increasing from 16 to 1024, we observed a, finally, close to linear increase of computational costs. 
Note, however, that a larger H implies a larger number of parameters, and thus may require more 
data points to prevent overfitting. Although a larger dataset increases computational demand, our 
truncated approximation algorithm allows us to take advantage of parallel computing architecture 
in order to more efficiently deal with large datasets (see Appendix [C] for details). Therefore in 
practice, we can weaken the extent of an increase in computational cost due to a higher demand 
for data. Furthermore, we examined the benefit of using GSC (in terms of average speedup over 
EM iterations) versus the cost regarding algorithmic performance. We compared approximation 
parameters in the range of H' = 7 = [1, 10] and again observed the performance of the algorithm 
on the task of source separation (with randomly generated orthogonal ground truth mixing bases 
and no observed noise). FigjT] shows that a high accuracy can still be achieved for relatively small 
values of H' 7 which, at the same time, results in strongly reduced computational demands. 



C Dynamic Data Repartitioning for Batch/Parallel Processing 

We have seen in section [4] that the truncated variational approach discriminatively selects the 
most relevant causes of an observation y in order to approximate the posterior distribution over 
the latent space. In practice we can further benefit from the latents' preselection by making us 



32 



L«--» 1 1 1 U 

32 64 128 256 512 768 1024 

H 

Figure 8: Time scaling behavior of GSC for increasing latent dimensions H and fixed truncation 
parameters H' and 7. 



of it for clustering the observed data for batch processing, which in turn allows for an efficient 
parallel implementation of the truncated EM algorithm. Prior to each E-step, we can partition the 
data by clustering it based on the most relevant causes chosen by the preselection step for each 
y. The resulting clusters can then be distributed among multiple compute cores to perform the 



E-step (Eqns. 25 to 27) in parallel. This approach not only pursues a natural partitioning of data, 
but in a parallel setting, it can prove to be more efficient than a uniform distribution of data as in 
|Bornschein et alT 2010 . By maximizing the similarity among data points assigned to an individual 



processing unit, we can minimize across all the units redundant computations involved in Eqns. 19 



and 27, that are tied to specific states of the causes. Particularly, as shown in the main text, the 



truncated posterior (25) takes the following form: 



Here the parameters jig, Cg and Kg entirely depend on a state s of causes. Also, k- takes 



prefactors that can be precomputed given s. To compute ( p3| ) , it turns out that our dynamic data 

redistribution strategy is more efficient than a static (and uniform) data distribution approach. 

This is illustrated in Fig. [9j which shows empirical E-step speedup over the static data distribution 

strategy taken as a baseline. The error bars were generated by performing 15 trials per given data 

size N. For all the trials, model scale (i.e., data dimensionality) and truncation approximation 

parameters were kept constant^] Each trial was run in parallel on 24 computing nodes. The red 
q 

The observed and the latent dimensions of the GSC model were 25 and 20 respectively. The truncation approxi- 
mation parameters H' and 7 (maximum number of active causes in a given latent state) were 8 and 5 respectively. 
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Runtime speedup over static data distribution by means of 
local vs. global dynamic data repartitioning 
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Figure 9: Runtime speedup of the truncated variational E-step (Eqns. 25 to 27) with the static data 



distribution strategy taken as a baseline. The red plot shows the speedup when initially uniformly 
distributed data samples were only clustered locally by each processing unit, while the blue plot 
shows the speedup as a result of globally clustering and redistributing the data. The runtimes 
include the time taken by clustering and repartitioning modules. 

plot in the figure also shows the speedup as a result of an intermediate approach. There we initially 
uniformly distributed the data samples which were then only locally clustered by each processing 
unit at every E-step. The blue plot on the other hand shows the speedup as a result of globaly 
clustering and redistributing the data prior to an E-step. It is important to note that all reported 
results also take into account the cost of data clustering and repartitioning. 

We optimize the data clustering process by having each processing unit cluster its own data 
locally and then merging the resulting clusters globally. To avoid unfair workload distribution, we 
also bound the maximum cluster size. Currently we pick (per iteration) top a percentile of cluster 
sizes as the threshold. Any cluster larger than a is evenly broken into smaller clusters of maximum 
Moreover, to minimize communication overhead among computational units, we only 



a size 



10 



recluster and distribute indices of the data points. This entails that the actual data must reside in 
a shared memory structure which is efficiently and dynamically accessible by all the computational 
units. Otherwise, all the units require their own copy of the whole dataset. 

It is important to note that although we have illustrated the gains of dynamic data repartition- 
ing using a specific sparse coding model which typically involves state-dependent computationally 
expensive operations, the technique itself is inherently generic as it is based on a variational EM 



approach which can be applied to a larger range of models and problems [see Liicke and Eggert 
20101 Sec 4]. 



l^The a for the reported experiments was 5. 
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Figure 10: Source separation with observation noise. Performance of GSC vs. MTMKL on lOhalo 
and EEG19 benchmarks with varying degrees of orthogonality of the mixing bases and Gaussian 
noise added to observations. Performance of GSC vs. MTMKL on the Speech20 benchmark with 
varying degrees of orthogonality of the mixing bases with Gaussian noise added to observed data. 
The orthogonality on the x-axis varies from being orthogonal _L to increasingly non-orthogonal 
mixing as radomly generated orthogonal bases are perturbed by adding Gaussian noise M(0, a) to 



them. Performance is compared on the Amari index (32) 
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